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ABSTRACT 


The  Armed  Serviees  Voeational  Aptitude  Battery  (ASVAB)  is  a  test  that 
approximately  700,000  students  in  12,000  high  sehools  take  each  year  to  determine 
military  occupation  placement.  Form  Assembly  for  the  ASVAB  refers  to  the  selection 
of  20-35  questions,  known  as  items,  from  an  item  pool  of  approximately  300  items  to 
create  a  paper  and  pencil  test  in  one  of  its  ten  topics.  Previous  research  formulates  form 
assembly  as  an  Integer  Linear  Program  (ILP).  The  current  ASVAB  mostly  uses  a 
Computer  Adaptive  Test  (CAT),  which  estimates  an  examinee’s  ability  after  the 
examinee  answers  each  item  and  selects  the  next  item  based  on  prior  performance.  The 
current  CAT -ASVAB  implementation  does  not  control  the  number  of  items  selected  from 
each  subject  (taxonomy  group)  for  a  test.  This  thesis  introduces  ILPs,  previously  used  for 
form  assembly,  that  impose  taxonomy  restrictions  and  applies  them  to  the  CAT- ASVAB. 
We  create  four  ILP  variations  and  test  them  against  the  current  method  of  item  selection, 
by  simulating  3,500  examinees  (500  examinees  each  for  seven  given  ability  levels).  The 
results  show  that  all  of  the  ILPs  have  acceptable  solution  times  for  CAT  use,  and 
taxonomy  restrictions  can  be  imposed  while  also  having  more  even  exposure  rates  (the 
number  of  times  an  item  is  administered  divided  by  the  number  of  examinees)  than  the 
current  implementation  of  the  CAT -ASVAB.  A  variation  that  relaxes  most  of  the  binary 
variables  and  constrains  the  difficulty  of  each  item  to  be  within  a  predetermined 
magnitude  of  the  current  ability  estimate,  performs  the  best  in  terms  of  item  exposure  (for 
both  under  and  over-utilized  items)  and  error  between  an  examinee’s  estimated  ability 
level  and  actual  ability  level. 


V 


THIS  PAGE  INTENTIONALLY  LEET  BLANK 


VI 


TABLE  OF  CONTENTS 


1.  INTRODUCTION . 1 

IL  BACKGROUND . 3 

A,  TEST  THEORY . 3 

B,  OPTIMIZATION  OF  FORM  ASSEMBLY  FOR  ASVAB  (PAPER 

AND  PENCIL) . 4 

C,  CAT-ASVAB . 7 

1.  Shadow  Test . 8 

2,  Taxonomy  and  Item  Exposure  Control  Research  for  CAT . 10 

III.  THE  CAT-ASVAB  OPTIMIZATION  MODELS . 15 

A,  SHADOW  TEST  FORMULATION  AND  VARIATIONS . 15 

B,  ABILITY  CALCULATION . 20 

IV.  RESULTS  OF  CAT-ASVAB  OPTIMIZATION  SIMULATIONS . 23 

A.  SETUP  FOR  SIMULATION . 23 

B.  RESULTS . 24 

V.  CONCLUSIONS  AND  FUTURE  RESEARCH . 33 

A.  CONCLUSIONS . 33 

B.  FUTURE  RESEARCH . 33 

LIST  OF  REFERENCES . 35 

INITIAL  DISTRIBUTION  LIST . 39 


vii 


THIS  PAGE  INTENTIONALLY  LEET  BLANK 


LIST  OF  FIGURES 


Figure  1:  Sample  Logistic  Function . 3 

Figure  2:  Exposure  Rates: . 26 

Figure  3:  Error  Histogram  of  OM . 27 

Figure  4:  Error  Histogram  of  KM . 28 

Figure  5:  Error  Histogram  of  DM . 28 

Figure  6:  Error  Histogram  of  SM . 29 

Figure  7:  Error  Histogram  of  SDM . 29 

Figure  8:  Bias  Function: . 30 

Figure  9:  MSE  Function: . 31 


THIS  PAGE  INTENTIONALLY  LEET  BLANK 


X 


LIST  OF  TABLES 


Table  1:  Parameter  Settings  for  Formulations . 24 

Table  2:  Taxonomy  Distribution . 25 

Table  3:  Solution  Times . 25 

Table  4:  p-values  versus  OM  for  Wilcoxon  Sign  Rank  Test . 27 


THIS  PAGE  INTENTIONALLY  LEET  BLANK 


ACKNOWLEDGMENTS 


I  would  like  to  express  my  deepest  gratitude  to  my  advisor,  Professor  Robert  Dell 
for  his  guidance,  time,  and  support  throughout  my  research.  His  efforts  were  invaluable 
in  bringing  this  thesis  to  a  successful  completion. 

I  would  also  like  to  thank  Professor  Johannes  Royset,  my  second  advisor,  for 
taking  the  time  to  provide  his  insight  and  support  for  me  to  complete  this  thesis. 

I  am  also  grateful  to  Iosif  Krass  and  Mary  Pommerich  at  the  Defense  Manpower 
Data  Center  for  offering  me  valuable  resources  necessary  for  me  to  complete  my 
research,  and  to  the  Defense  Manpower  Data  Center  for  giving  me  the  opportunity  to 
work  on  this  topic. 

Finally,  I  would  like  to  thank  my  family  and  friends  for  their  continued 
encouragement.  Without  their  support,  I  almost  certainly  would  not  have  come  this  far. 


THIS  PAGE  INTENTIONALLY  LEET  BLANK 


XIV 


EXECUTIVE  SUMMARY 


The  Armed  Serviees  Voeational  Aptitude  Battery  (ASVAB)  is  a  test  that 
approximately  700,000  students  in  12,000  high  sehools  take  each  year  to  determine 
military  occupation  placement.  Form  Assembly  for  the  ASVAB  refers  to  the  selection 
of  20-35  questions,  known  as  items,  from  an  item  pool  of  approximately  300  items  to 
create  a  paper  and  pencil  test  in  one  of  its  ten  topics.  ASVAB  form  assembly  has  been 
previously  formulated  as  an  integer  linear  program  (ILP)  with  an  objective  function  that 
minimizes  the  deviation  from  a  predetermined  goal  curve  for  the  test. 

Most  of  the  ASVAB  tests  are  administered  as  a  Computer  Adaptive  Test  (CAT). 
The  CAT  estimates  an  examinee’s  ability  after  the  examinee  answers  each  item  and 
selects  the  next  item  based  on  prior  performance.  Because  the  CAT  is  able  to  determine 
an  examinee’s  ability  level  after  each  question  and  select  future  questions  based  on  this 
estimator,  the  test  length  for  a  CAT  is  shorter  than  a  paper  and  pencil  test.  However,  the 
current  CAT-ASVAB  does  not  control  the  number  of  items  selected  from  each  subject 
(taxonomy  group)  for  a  test.  Therefore,  this  taxonomy  distribution  of  the  items  in  a  test 
can  be  heavily  skewed  toward  a  particular  subject.  A  solution  to  this  problem  is  for  a  test 
to  not  only  select  the  next  item,  but  select  an  entire  test  trajectory  for  the  examinee’s 
current  estimated  ability.  This  is  called  a  shadow  test,  and  this  thesis  combines  a 
shadow  test  with  previously  researched  paper  and  pencil  form  assembly  for  application  to 
the  CAT-ASVAB. 

This  thesis  also  discusses  other  problems  associated  with  the  CAT,  such  as  item 
exposure  control  and  solution  time.  One  method  it  explores  is  item-stratification.  In  this 
method,  the  item  selection  algorithm  divides  the  item  pool  into  groups  according  to  their 
discrimination  parameter  (an  item  with  a  high  discrimination  parameter  is  able  to  separate 
examinees  with  nearly  the  same  ability,  whereas  a  low  discrimination  parameter  does  not 
separate  them  as  well)  and  divides  the  test  into  an  equal  number  of  stages.  The  purpose  is 


XV 


to  select  items  with  a  lower  discrimination  (and  therefore  lower  information  value) 
toward  the  beginning  of  a  test,  and  leave  items  with  a  higher  discrimination  (and  higher 
information  value)  until  the  end  when  the  ability  estimate  is  more  accurate. 

There  are  five  variations  of  CAT-ASVAB  item  selection  considered  in  this  thesis: 
1)  A  previously  researched  paper  and  pencil  form  assembly  method  for  the  ASVAB 
(KM);  2)  KM  that  constrains  the  difficulty  parameter  (a  parameter  that  measures  the 
difficulty  of  an  item)  to  be  within  a  certain  amount  of  the  current  ability  level  of  the 
examinee  (DM);  3)  KM  with  the  addition  of  item-stratification  constraints  (SM);  and  4) 
KM  that  has  both  difficulty  parameter  constraints  and  item  stratification  constraints 
(SDM);  5)  The  current  item  selection  method  of  the  CAT-ASVAB  (OM),  is  a 
benchmark  to  compare  the  other  four.  Each  of  the  five  variations  of  the  model  is 
examined  using  3,500  artificially  generated  examinees  (500  examinees  each  for  seven 
given  ability  levels).  Aside  from  SM  and  SDM  having  a  high  maximum  exposure  rate, 
our  results  indicate  that  all  of  the  shadow  test  variations  have  more  even  exposure  rates 
than  the  current  implementation  of  the  CAT-ASVAB,  having  significantly  less  unutilized 
items.  DM  performs  the  best  in  terms  of  item  exposure  (for  both  under  and  over-utilized 
items)  and  error  between  an  examinee’s  estimated  ability  level  and  actual  ability  level. 
All  of  the  variations  benefit  from  the  ability  to  add  taxonomy  constraints.  Without  the 
taxonomy  constraints,  our  results  suggest  that  the  current  CAT  implementation  has  a 
taxonomy  distribution  heavily  favoring  one  of  the  taxonomy  groups. 
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I.  INTRODUCTION 


Since  1968,  all  US  military  applicants  take  the  Armed  Services  Voeational 
Aptitude  Battery  (ASVAB)  to  determine  military  occupation  placement.  Approximately 
700,000  students  in  12,000  High  Schools  take  this  test  every  year  [Pommerich  2005]. 
Form  assembly  for  the  ASVAB  refers  to  the  seleetion  of  multiple  ehoiee  questions, 
known  as  items,  out  of  a  given  item  pool  to  ereate  a  paper  and  pencil  test  in  one  of  its  ten 
topics.  A  typical  form  has  20-35  items  selected  from  an  item  pool  of  approximately  300 
items.  Kunde  [1997]  formulates  form  assembly  as  an  integer  linear  program  (ILP)  and 
solves  it  both  optimally  and  using  heuristics. 

In  1997,  many  ASVAB  tests  were  still  commonly  administered  in  their  printed 
(paper  and  peneil)  form.  The  ASVAB  has  since  moved  toward  being  a  Computer 
Adaptive  Test  (CAT)  [e.g.,  Weiss  2004].  Other  tests  that  use  a  CAT  include  the  GRE 
[e.g.,  Syvum  2006]  and  GMAT  [e.g.,  Prineeton  Review  2006].  The  CAT  estimates  an 
examinee’s  ability  after  the  examinee  answers  eaeh  item  and  selects  the  next  item  based 
on  this  estimator.  This  allows  it  to  use  fewer  items  than  a  paper  and  pencil  exam  to 
determine  an  examinee’s  ability. 

The  eurrent  CAT -ASVAB  item  selection  algorithm  does  not  currently  take  into 
aceount  item  taxonomy  eonstraints  [Sands,  Waters,  and  MeBride  1999].  A  taxonomy 
constraint  imposes  a  limit  on  the  number  of  items  from  a  given  subject  (e.g.  Addition, 
Division,  etc.).  Veldkamp  and  van  der  Linden  (2004)  use  a  shadow  test  to  determine  the 
next  question.  A  shadow  test  ereates  a  whole  test  trajeetory  for  the  examinee’s  current 
estimated  ability  then  ehooses  the  best  item  amongst  that  trajeetory  to  administer.  By 
ereating  this  whole  test,  other  eonstraints  can  be  added  to  the  formulation,  including 
taxonomy  constraints. 

This  thesis  extends  the  ILP  from  Kunde  [1997]  for  use  as  a  shadow  test  and 
applies  it  to  item  selection  for  a  CAT -ASVAB.  The  primary  extensions  speed  solution 
time  and  control  item  exposure.  Item  exposure  control  refers  to  limiting  the  number  of 
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times  a  test  administers  an  item  to  a  set  of  examinees.  Too  many  examinees  reeeiving  the 
same  item  inereases  the  likelihood  of  a  future  examinee  having  advanced  knowledge  of 
an  item. 
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II.  BACKGROUND 


A.  TEST  THEORY 

The  ASVAB  uses  Item  Response  Theory  (IRT)  to  measure  the  preeision  of  each 
test.  An  examinee’s  ability  level  is  denoted  as  0.  It  is  assumed  that  0  follows  a  standard 
normal  distribution  (mean  of  zero  and  a  standard  deviation  of  one).  The  range  of  0  is 
commonly  set  between  -3.0  and  3.0  or  -2.5  or  2.5  [Sands,  Waters,  and  McBride  1999].  In 
IRT,  the  probability  of  an  examinee,  with  ability  level  0,  answering  an  item  correctly  is 
calculated  with  the  three  parameter  logistic  function  shown  below  [Lord  1980]: 


Probability  of  Correct  Answer 


Figure  1:  Sample  Logistic  Function 
In  the  above  sample,  the  discrimination  parameter:  a=2.24,  the 
difficulty  parameter:  b=0J2,  and  the  guessing  parameter:  c=0.4 

The  3  parameters  are  a,  b,  and  c,  with  D  being  a  scaling  factor.  The  a  parameter 
is  the  discrimination  of  the  item.  This  is  the  capability  of  the  item  to  distinguish  between 
applicants  of  different  abilities.  In  Figure  1,  the  a  parameter  is  proportional  to  the  slope 
of  the  logistic  function  at  its  inflection  point.  The  steeper  the  slope,  the  greater  the 
difference  examinees  with  different  ability  levels  have  in  answering  an  item  correctly;  a 
flatter  slope  means  examinees  with  different  ability  levels  have  more  similar  probabilities 
of  a  correct  response.  The  b  parameter  measures  the  difficulty  of  an  item.  In  Figure  1, 
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the  b  parameter  determines  the  position  of  the  eurve’s  infleetion  point  along  the  0-axis. 
Finally,  parameter  c  is  the  guessing  parameter.  This  is  the  probability  of  a  person  with  a 
low  ability  level  guessing  the  item  correetly.  This  parameter  shows  up  in  Figure  1  as  the 
lower  asymptotie  bound  on  p(6ys  axis.  These  parameters  are  typieally  ealeulated  after 
the  item  has  been  pretested  1,000  to  10,000  times.  From  here,  the  item  information 
funetion  ean  be  derived  from p(6),  [Lord  1980] 


7,(0)= - - 

pm^-pm 


or 


D^a\\-c) 


1.(0) 


-Da{6—b)  s 


where  p '  is  the  derivative  of  p .  The  presenee  of  the  derivative  in  the  numerator 
indieates  that  items  with  a  higher  diserimination  parameter  have  a  higher  information 
value.  Beeause  the  information  eontribution  of  eaeh  item  is  assumed  to  be  independent 
of  the  other  items  in  the  ASVAB,  the  item  information  funetions  ean  be  added  together  to 
produee  an  overall  information  eurve.  With  A  being  the  number  of  items  in  the  form,  the 
exam  information  function  is  [Lord  1980]: 

i=\ 


This  function  measures  the  precision  of  the  exam  in  estimating  an  examinee’s  true 
ability  level.  The  next  section  shows  how  the  above  information  function  is  applicable  to 
form  assembly. 

B,  OPTIMIZATION  OF  FORM  ASSEMBLY  FOR  ASVAB  (PAPER  AND 
PENCIL) 

This  section  describes  Kunde’s  paper  and  pencil  formulation,  which  is  used  in  the 
optimization  model  in  this  thesis  for  the  CAT -ASVAB.  Kunde’s  formulation  has  two 
goals  expressed  in  the  objective  function.  The  first  is  to  minimize  the  difference  between 
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the  information  of  the  exam  and  the  information  from  a  goal  curve.  A  goal  curve  is  a  test 
information  function  like  the  one  introduced  in  the  previous  section  that  represents  the 
desired  information  distribution  of  the  exam  across  the  ability  levels.  It  is  produced  from 
empirical  research  and  testing.  The  deviations  between  an  assembled  form  and  the  goal 
curve  for  specific  values  of  0  are  organized  by  their  magnitude  into  groups  which  are 
denoted  in  the  formulation  below  by  the  index  g.  Each  group  is  assigned  a  penalty  per 
unit  of  deviation.  Higher  deviations  from  the  goal  curve  receive  a  higher  penalty  per  unit 
deviation. 

For  security  purposes,  alternate  forms  are  created  for  an  exam  (denoted  by  the 
index  f).  This  leads  to  the  second  goal  of  the  formulation:  to  make  each  form  as  similar 
as  possible  in  information.  The  second  component  of  the  objective  function  seeks  to 
minimize  the  deviations  of  each  form  from  the  first  reference  form. 

Below  is  Kunde’s  integer  linear  program  formulation  for  the  paper  and  pencil 

form. 

Indices: 
i 

e 

f 

t 
g 

Sets: 

TaxItemSj 
Data: 

CAT, 

mo 

NITEM^ 

PARAWEI 


item  from  the  item  pool; 
ability  level; 

form  to  be  assembled  (1,2,...F); 
taxonomy(  1 ,2, ...  T); 
penalty  group 

The  set  of  items  in  taxonomy  group  t 

The  maximum  deviation  between  a  form  and  the  goal  curve  in  group  g 
Information  value  of  item  i  at  percentile  6 
The  required  number  of  items  in  taxonomy  t 
Weight  that  combines  the  two  goals 
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PENALTY^  Penalty  per  unit  deviation  within  group  g 

SHAPE g  The  information  value  for  the  goal  eurve  at  pereentile  6 

One,  if  item  i  is  used  in  form/ 

Deviation  above  the  goal  eurve  in  group  g  at  pereentile  6  on  form/ 
Deviation  below  the  goal  eurve  in  group  g  at  pereentile  6  on  form / 
The  total  information  form  one  eontains  that  exeeeds  form / 

The  total  information  form/ eontains  that  exeeeds  form  one 

Formulation: 

min 


X  X  Z  penalty^  {pyg^^  +  )  +  PARA  WEI^,  {delplus , 

e  f  g  />i 

-  delnegf ) 

(kl) 

sueh  that 

Ypym  -SHAPE, 

g  i 

yoj 

(k2) 

E  ^  -Z  +  shape, 

g  i 

yoj 

(k3) 

Y,x,f=NITEM, 

i  gT axIterUf 

(k4) 

f 

\fi 

(k5) 

TaH^^Ew^h  =  delplus^.-  delneg^ 

id  i  e 

V/>1 

(k6) 

0<  py(^g<  CAT^ 

^OJ,g 

(k7) 

0<  ny^g<  CAT^ 

^0J,g 

(k8) 

Xy  binary 

(k9) 

delplus ^  ,  delnegf  >  0 

V/ 

(klO) 

Variables: 

pyef, 

delplus  ^ 
delnegr 
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The  first  component  in  the  objective  function  (kl),  corresponding  to  the  first  goal 
of  minimizing  deviation  from  the  goal  curve,  +  „y,„) , 

expresses  the  vertical  deviation  from  the  goal  curve.  The  variables  and  are 

the  positive  and  negative  deviations,  respectively,  of  form / from  the  goal  curve,  in  group 
g,  for  ability  6.  In  the  second  component  of  the  objective  function, 
PARAWEI^  (  delplusj  +  delnegj ),  the  variable  delplusj  is  the  total  form  one 

information  in  excess  of  form/  while  delnegj  is  the  total  form /information  in  excess  of 
form  one. 

Constraints  sets  (k2)  and  (k3)  give  the  values  for  the  positive  and  negative 
deviations  of  the  information  function  from  the  goal  curve.  Set  (k4)  specifies  the  number 
of  items  in  a  form  from  a  given  taxonomy.  Set  (k5)  states  that  item  i  can  only  appear  in 
at  most  one  form.  Set  (k6)  gives  the  total  difference  in  information  between  the  forms, 
and  sets  (k7)  and  (k8)  bound  the  deviations  of  the  information  function  from  the  goal 
curve. 


C.  CAT-ASVAB 

The  formulation  above  optimizes  the  objective  function  across  all  0s,  and  creates 
a  form  that  satisfies  a  set  of  specified  attributes  (e.g.,  length  and  taxonomy).  In  a  CAT, 
the  examinee’s  current  performance  on  the  exam  determines  each  item  that  is 
administered.  Therefore,  at  a  given  point  in  an  exam,  an  individual  with  a  higher  ability 
level  receives  an  item  of  more  difficulty  than  an  individual  with  a  lower  estimated  ability. 
Because  the  examinee  receives  an  item  based  on  his  estimated  ability,  the  exam  can 
produce  a  better  estimate  for  the  examinee’s  ability  in  fewer  questions.  As  currently 
implemented,  all  examinees  start  with  the  same  average  ability  level  estimate,  6^=  0. 

The  CAT-ASVAB  uses  the  Owens  Bayes  algorithm  of  calculating  the  ability  after  each 
item  is  answered.  Because  the  order  of  items  administered  affects  the  ability  calculation, 
an  additional  Bayesian  module  calculation  is  used  to  calculate  0  at  the  end  of  the  test. 
Currently,  the  item  selection  algorithm  for  the  CAT-ASVAB  seeks  to  maximize  the  item 
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information  function  at  the  examinee’s  eurrent  0  and  limit  item  exposure.  The 
information  values  are  pulled  from  a  table  by  0.  [Sands,  Waters,  and  MeBride  1999] 

1.  Shadow  Test 

One  method  proposed  to  deal  with  the  taxonomy  eonstraints  is  a  shadow  test  [e.g. 
van  der  Linden  and  Veldkamp  1998].  Instead  of  merely  ealeulating  the  best  item  to 
administer  at  the  eurrent  0,  a  whole  test  trajeetory  is  eonstrueted  for  the  examinee  at  the 
eurrent  0.  The  indiees  used  in  the  formulation  below  are  the  same  as  in  Kunde’s 
formulation  with  the  addition  of  an  index  h,  a  quantitative  attribute  group.  An  example 
of  a  quantitative  attribute  group  is  the  total  word  eount  for  all  items  in  the  group  adding 
up  to  a  pre-speeified  total.  Thus,  a  possible  eonstraint  would  be  to  limit  the  total  word 
eount  for  a  set  of  items  in  eaeh  group  h.  This  is  represented  by  the  following  eonstraints: 


‘^h 

<UH, 

\/h 

>LH, 

V/z, 

where  ,  in  this  example,  is  the  word  eount  for  item  i,  UH^  and  LH^  are  an  upper  and 
lower  bound  respeetively  on  the  sum  of  the  word  eounts  for  all  items  in  group  h,  and 
is  the  set  of  items  in  group  h.  Below  is  Veldkamp  and  van  der  Linden’s  formulation 
using  notation  eonsistent  with  Kunde’s  formulation  above. 

Indiees: 

k  iteration  eount  where  examinee  is  given  his  Ath  question 

h  quantitative  attribute  group 

Sets: 

Fix  The  set  of  items  already  administered 

The  set  of  items  in  quantitative  attribute  group  h 

Data: 
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Current  ability  estimate  after  k-\  items  have  been  administered 
L.^  Quantitative  attribute  for  item  i  for  attribute  group  h 

UH^  Upper  bound  for  number  of  items  in  group  h 

LH,^  Lower  bound  for  number  of  items  in  group  h 

UT^  Upper  bound  for  number  of  items  in  taxonomy  t 

LT^  Lower  bound  for  number  of  items  in  taxonomy  t 

1^0)  The  item  information  value  at  6 

Deeision  Variable: 

X,  One,  if  item  i  is  used  in  the  shadow  test 


Formulation: 

max 

i 

(vl) 

sueh  that 

X,  =1 

V/  e  Fix 

(v2) 

i^axltemsi 

\/t 

(v3) 

i^axhemsi 

\/t 

(v4) 

\fh 

(v5) 

(v6) 

z^. 

i 

(v7) 

X.  binary 

V/ 

(v8) 

The  model  seleets  the  item  with  the  greatest  information  from  the  items  in  the 

shadow  test  that  have  not  already  been  administered  at  the  eurrent  ability,  j . 

Constraint  set  (v2)  sets  x,  to  1  for  the  items  i  that  have  already  been  administered. 

Constraint  sets  (v3)  and  (v4)  are  taxonomy  eonstraints  and  set  an  upper  and  lower  limit 
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respectively  on  the  number  of  items  administered  from  each  taxonomy  group.  Constraint 
sets  (v5)  and  (v6)  are  the  above  mentioned  quantitative  attribute  constraints.  “Because 
each  shadow  test  meets  the  constraints,  the  adaptive  test  automatically  meets  them”  [van 
der  Linden  and  Veldkamp  2004]. 

2.  Taxonomy  and  Item  Exposure  Control  Research  for  CAT 

Much  research  has  been  done  on  different  ways  to  implement  CAT.  Because  one 
of  the  main  concerns  with  CAT  is  item  exposure  control,  many  papers  written  about  CAT 
implementation  discuss  possible  solutions  for  this  issue.  The  CAT-ASVAB  currently 
uses  Sympson  and  Hetter’s  [1985]  algorithm  to  control  item  exposure.  This  thesis  uses 
this  algorithm  for  its  optimization  model  as  well.  The  Sympson  and  Hetter  algorithm 
assigns  a  number  between  zero  and  one,  called  the  item  exposure  parameter,  to  each  item. 
A  pretest  simulation  determines  these  parameters.  Items  with  a  higher  exposure  rate  at 
the  end  of  the  simulation  receive  a  lower  exposure  parameter.  During  the  actual  test, 
when  the  test  selects  an  item,  it  generates  a  random  number  uniformly  distributed 
between  zero  and  one.  If  the  item  exposure  parameter  of  this  item  is  less  than  the  random 
number,  the  test  rejects  the  item  and  selects  the  item  with  the  next  highest  information 
value,  and  so  on. 

Another  technique  to  control  item  exposure  is  called  5-4-3-2-1  [Sympson  and 
Hetter  1997].  The  first  item  is  chosen  randomly  out  of  the  five  most  informative  items. 
The  next  item  is  then  chosen  randomly  out  of  the  four  most  informative,  and  so  on  until  it 
is  choosing  from  one  item.  Afterwards,  the  procedure  starts  over  again  at  five  items. 
Another  randomization  technique  is  to  choose  one  item  out  of  three,  then  disqualify  the 
other  two  from  further  administration  [Thomasson  1998].  Another  technique  does  not 
use  the  item  information  value,  but  randomly  selects  from  items  within  a  specified 
distance  from  a  target  difficulty  level  [Lunz  and  Stahl  1998]. 

Other  methods  require  a  more  significant  change  in  item  or  test  structure  to 
address  item  exposure  control.  One  method  is  item  stratification,  and  this  thesis  also 
includes  this  method  into  its  optimization  model.  Items  fall  into  n  groups  called  strata  by 
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their  a  parameters,  and  exams  divide  into  n  stages.  For  a  model  with  taxonomy 
eonstraints,  this  first  eategorizes  the  items  by  their  taxonomy  before  sorting  the  items 
within  eaeh  taxonomy  by  the  a  parameter.  It  then  divides  the  items  in  eaeh  taxonomy 
into  n  groups.  Items  from  the  first  group  in  eaeh  taxonomy  go  into  the  first  strata,  items 
from  the  seeond  group  go  into  the  seeond  strata,  and  so  on  until  there  are  n  strata.  During 
the  nth  stage,  the  test  seleets  an  item  from  the  nth  strata  [Leung,  Chang,  and  Hau  2003]. 
Item  stratifieation  seleets  items  with  a  lower  diserimination  value  near  the  beginning  of 
the  test.  Beeause  items  with  a  higher  diserimination  also  earry  higher  information  values, 
item  stratifieation  is  eontrary  to  the  typieal  approaeh  of  seleeting  the  item  with  the  highest 
information  value.  Item  stratifieation  reserves  the  items  that  earry  more  information 
toward  the  end  of  the  exam  where  the  ability  estimate  is  eloser  to  the  true  ability.  In  a 
study  done  by  Chang  and  van  der  Linden,  item  stratifieation  yields  more  even  exposure 
rates  throughout  the  items,  thus  having  fewer  underexposed  and  overexposed  items. 
Below  is  the  formulation  of  the  item  stratifieation  model  into  a  shadow  test.  The  indiees 
are  the  same  as  the  shadow  test  formulation  given  in  the  previous  seetion,  with  the 
addition  of  the  index  r,  the  stratum.  [Chang  and  van  der  Linden,  2003] 

Indiees: 
r 

Sets: 

e. 

Data: 

Variables: 

y  Deviation  of  item’s  diffieulty  parameter  from 


stratum; 

The  set  of  items  at  the  strata  r  when  seleeting  item  k 

The  required  number  of  items  from  strata  r 
Diffieulty  of  item  i  (standard  deviations  from  0=0) 
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Formulation: 

miny 

(cl) 

such  that 

(5.-4_i)x,  <y 

(c2) 

(5.-4_i)x,  >-y 

(c3) 

X,.  =1 

V/  e  Fix 

(c4) 

II 

Vr 

(c5) 

i  VT, 

i^axltemsf 

Vt 

(c6) 

^  LT, 

iETaxltemsi 

Vt 

(c7) 

\fh 

(c8) 

i^Qh 

Vh 

(c9) 

y>0 

(clO) 

X.  binary 

V/ 

(ell) 

Items  with  a  difficulty  parameter  closest  to  the  current  estimate  of  ability,  j , 
are  chosen  within  the  given  constraints.  Constraint  set  (c4)  specifies  the  number  of  items 
that  must  come  from  each  strata.  The  rest  of  the  constraints  are  the  same  as  the  shadow 
test. 


Another  method,  the  Computerized  Adaptive  Sequential  Test  (CAST),  partitions 
the  test  into  a  collection  of  subtests  such  that  these  subtests  become  the  units  of  test 
administration  instead  of  items  [Davis  and  Dodd  2003].  This  method  groups  the  items 
into  sub  tests  called  modules  and  places  them  in  multistaged  panels.  There  are  two  ways 
to  construct  the  panels.  The  first  is  bottom-up  construction  that  assembles  the  items  into 
modules  such  that  each  module,  as  a  self-contained  unit,  meets  the  requisite  information. 
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content,  and  item  feature  targets  seleeted  for  the  test  [Davis  and  Dodd  2003].  The  seeond 
method  of  panel  eonstruetion  is  top-down,  where  any  module  path  through  the  panel 
results  in  a  test  of  appropriate  preeision,  eontent,  and  item  type  [Davis  and  Dodd  2003]. 
The  method  used  in  Davis  and  Dodd’s  study  is  the  bottom-up  eonstruetion.  With  the 
exeeption  of  the  first  stage,  the  test  segregates  the  modules  by  diffieulty  level  in  eaeh 
stage.  The  first  stage  has  only  one  module.  A  typieal  alloeation  for  the  other  stages 
would  plaee  three  modules  in  the  seeond  and  third  stage,  with  eaeh  module  eorresponding 
to  a  low,  medium,  and  high  diffieulty.  A  panel  is  randomly  assigned  to  an  examinee  at 
the  beginning.  From  there,  at  the  first  stage,  the  examinee  reeeives  a  subtest.  When  the 
examinee  completes  the  module,  the  test  caleulates  his  ability,  and  in  the  next  stage,  it 
bases  the  next  module  the  examinee  reeeives  on  his  eurrent  estimated  ability.  An 
examinee  ean  only  move  up  one  level  between  stages.  For  example,  one  eannot  reeeive 
an  easy  module  after  eompleting  a  hard  module  the  stage  before.  Like  a-stratifieation, 
this  method  also  yielded  more  even  exposure  rates  [David  and  Dodd  2003]. 

Two  of  the  methods  mentioned  above  for  item  exposure  eontrol,  the  Sympson  and 
Hetter  algorithm  and  item  stratifieation,  are  ineorporated  into  the  optimization  model  for 
this  thesis  as  well  as  alternate  forms  from  the  paper  and  peneil  exam.  Shadow  tests  in  the 
existing  researeh  use  the  existing  maximum  information  or  minimum  diffieulty  deviation 
as  objeetive  funetions.  The  formulation  in  the  following  seetion,  however,  uses  the 
deviation  from  a  goal  eurve  as  in  Kunde’s  paper  and  peneil  formulation  for  the  objective 
funetion. 
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III.  THE  CAT-ASVAB  OPTIMIZATION  MODELS 


A.  SHADOW  TEST  FORMULATION  AND  VARIATIONS 

The  integer  linear  program  (ILP)  in  this  thesis  uses  Kunde’s  formulation  as  a 
starting  point  and  adapts  it  for  use  in  the  CAT-ASVAB  as  a  shadow  test.  In  his  paper  and 
peneil  formulation,  Kunde  uses  alternative  forms  as  a  means  of  test  seeurity.  This 
shadow  test  formulation  retains  the  alternative  forms  as  a  means  of  item  exposure  control 
along  with  the  Sympson-Hetter  method.  For  this  thesis,  the  test  creates  two  forms,  with 
15  items  each,  for  each  shadow  test.  An  examinee  starts  off  on  one  of  the  forms.  Each 
item  selected  first  goes  through  the  Sympson-Hetter  algorithm.  If  the  algorithm  rejects 
an  item,  the  test  administers  the  item  with  the  most  information  from  the  alternative  form. 
The  test  does  not  use  the  rejected  items  again  for  the  remainder  of  the  exam.  If  the 
Sympson-Hetter  algorithm  also  rejects  the  item  from  the  alternative  form,  the  test  goes 
back  and  selects  the  next  most  informative  item  from  the  first  form,  and  so  on.  If  the 
items  in  the  shadow  tests  to  choose  from  run  out,  the  test  reruns  the  model  to  obtain  a 
new  shadow  test. 

As  mentioned  earlier,  the  solution  time  of  the  shadow  test  is  critical.  To  speed  up 
solution  times,  this  formulation  relaxes  Kunde’s  ILP  such  that  only  the  x,y  value  for  the 

current  item  needs  to  be  binary,  while  the  rest  of  the  values  can  be  continuous. 

Allowing  continuous  variables  could  decrease  overall  solution  quality,  but  we  did  not 
observe  any  substantial  differences.  For  the  relaxation,  the  formulation  splits  into  a 

binary  and  continuous  component,  xb-j-  and  xc-j- ,  respectively.  Therefore  the  constraint 
set  from  the  original  formulation: 

binary  V/,/ 

is  replaced  with  the  below  constraint  sets. 


0  <  Xy  <  1 
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V/,/ 


0  <  xc,y  <  1 

xby  binary 

x^J  =  xb-j-  +  xcy 

To  specify  that  at  least  one  ,  other  than  the  administered  items,  is  an  integer,  the 
following  constraint  is  added. 


+1  V/ 

i  teFix 

Kunde’s  formulation,  along  with  the  addition  of  the  above  constraints,  establishes 
the  base  model  for  this  thesis  (KM).  For  this  thesis,  we  develop  three  other  variations  for 
comparison.  One  variation  (DM)  comes  from  the  observation  that  items  administered 
with  a  higher  deviation  between  the  b  parameter  and  current  ability  estimate  tend  to  have 
a  smaller  effect  on  the  ability  estimate.  For  example,  if  an  individual  answered  an  item 
correctly  in  which  the  difficulty  parameter  was  far  below  his  current  ability,  it  would 
barely  affect  the  new  ability  estimate.  Therefore,  for  this  variation,  the  two  constraints 
below  are  added  to  constrain  the  difficulty  parameter  to  be  within  a  given  number,  BLIM, 
of  the  current  ability  estimate. 

( h,  -  )  Xy  <  BLIM  Vz  ^  Fix 

( h,  -  )  Xy  >  -  BLIM  Vz  ^  Fix 

Using  the  same  notation  as  Kunde’s  formulation  and  van  der  Linden’s  sample 
shadow  test,  below  is  the  formulation  for  this  variation. 

Data: 

BLIM  Maximum  deviation  of  item  difficulty  from  current  ability 

Variables: 
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One,  if  item  i  is  used  in  form/ 

Continuous  eomponent  of 

xbif 

Binary  eomponent  of  x-j 

Formulation: 

min 

X  X  Z  penalty^  )  +  PARA  WEI^,  {delplus , 

e  f  g  />i 

-  delnegf ) 

(dl) 

s.t. 

g  i 

yoj 

(d2) 

Z  ^y^g  ^  -Z 

g  i 

yoj 

(d3) 

^Xy.=NITEM, 

i  ETaxItenif 

(d4) 

VI 

V. 

w- 

\fi 

(d5) 

(brL,)^y  <BLIM 

V/  ^Fix,f  (d6) 

{brL^)Xif  >-BLIM 

Vi  ^Fix,f  (d7) 

ZZ^^^.v^n  -mibNF^eX.f  =  delplusj- 
i  e  i  0 

-  delnegf 

V/>1 

(d8) 

0<  py^-^<  CAT^ 

^0J,g 

(d9) 

0<  ny^.^<  CAT^ 

^oj,g 

(dlO) 

X,f=l 

\/ieFix,f  (dll) 

Xy  =  xby  +  xc^f 

yij 

(dl2) 

Y.^b,f  -  Z^r  +1 

i  ieFix 

V/ 

(dl3) 

0  <  x,y  <  1 

(dl3) 

0  <  XC,y  <  1 

(dl5) 
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xb-^-  binary 

delplus  f  ,  delnegf  >  0 


V/ 


(dl6) 

(dl7) 


The  second  variation  (SM)  uses  item  stratification.  It  adds  the  below  constraint, 
adapted  from  Chang  and  van  der  Linden’s  shadow  test  formulation  with  item 
stratification,  to  the  formulation. 

In  order  to  ensure  that  the  decision  variable  for  an  item  from  the  current  stage  is  binary, 
the  formulation  sets  all  of  the  items  in  the  shadow  test  at  the  current  stage  as  binary.  The 
below  constraint  achieves  this  purpose. 

^  =S^  Vr  =  CURSTG,  f 

isQr 

where  CURSTG  is  the  current  stage  of  the  exam. 

The  third  variation  (SDM)  combines  the  DM  and  SM  formulations.  However, 
instead  of  adding  the  two  constraints  to  limit  the  difficulty  parameter,  the  formulation 
relaxes  the  two  constraints  and  inserts  them  into  the  objective  function  as  a  price  for 
deviating  too  far  from  the  current  ability  estimate.  The  new  objective  function  is 
therefore 

min  ^  ^  ^  PENALTY^{py )  +  PARA  WEl'^  {delplus ^  -  delneg^  ) 

e  f  g  ./>i 

+  DIFPEN^  ^  pbdevy  +  nbdevy  ^ 

‘  f 

where  pbdeVy  and  nbdevy  are  given  below 


( b- -  )  Xy  <  BLIM  +  pbdeVy 


Vi  ^Fix,f 
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( -  6i^_^ )  x,y  >  -  BLIM  -  nbdev-j 


V/  ^  Fix,f, 


and  DIFPEN  is  the  penalty  per  unit  for  more  than  BLIM  units  over  or  under  the  eurrent 
ability  estimate.  The  reason  for  not  adding  the  diffieulty  eonstraints  direetly  into  the 
formulation  is  beeause  eombined  with  the  item  stratifieation  eonstraints,  the  addition  of 
the  diffieulty  parameter  eonstraints  tends  to  result  in  an  infeasible  solution.  Below  is  the 
SDM  formulation. 

Data: 

CURSTG  Current  stage  of  exam 

Variables: 

pbdev.j  The  additional  positive  deviation  of  item  i’s  diffieulty  parameter  from  the 
eurrent  ability  estimate  greater  than  BLIM 

nbdev-j^  The  additional  negative  deviation  of  item  i’s  diffieulty  parameter  from  the 
eurrent  ability  estimate  less  than  BLIM 

Formulation: 

Min 

(delpluSj  -  delneg^  ) 

e  f  s  />i 

+  DIFPEN'^'^i^pbdev  y  +nbdevy^  (sdl) 

i  f 

s.t. 

Z  pyw,  ^  Z  -  shape,  \fej  (Sd2) 

g  i 

Z  ^  -Z  +  shape,  yej  (sdS) 

g  i 
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^x,=NITEM, 

i  eT calterUf 

'df,t 

(sd4) 

VI 

V. 

w- 

(sd5) 

( b- -  9^^^ ) <  BLIM  +  pbdev.^ 

Vz  ^  Fix ,/ 

(sd6) 

( h,  -  >  -  BLIM  -  nbdev-^ 

Vz  ^  Fix ,/ 

(sd7) 

“ZZ^^^'^^v  =  delplusj  -  delnegj 

i  e  i  e 

V/>1 

(sd8) 

0<  py^-^<  CAT^ 

^0J,g 

(sd9) 

0<  ny^.^<  CAT^ 

^oj,g 

(sdlO) 

X.f=\ 

Vz  eFix,f 

(sdll) 

Xy  =  xby  +  xc,f 

(sdl2) 

II 

\/r  =  CURSTG,f 

(sdl3) 

II 

yrj 

(sdl4) 

0  <  X,y  <  1 

(sdl5) 

0  <  xcy  <  1 

(sdl6) 

binary 

(sdl7) 

delplus ^  ,  delnegj  >  0 

V/ 

(sdl8) 

B,  ABILITY  CALCULATION 

The  Owens  Bayes  algorithm  [Sands,  Waters,  and  McBride  1999],  which  the 

CAT-ASVAB  normally  uses  to  calculate  the  ability  after  an  examinee  answers  each  item, 

assumes  that  if  an  examinee  answers  an  item  correctly,  he  receives  a  more  difficult  item 

next,  and  if  he  answers  incorrectly,  he  receives  an  easier  item  [Krass  2005].  Because 

none  of  the  shadow  test  variations  above  consistently  follow  this  behavior,  this  thesis 

uses  a  different  algorithm  to  estimate  the  ability  after  an  examinee  answers  each  item. 

20 


This  algorithm,  developed  by  Dan  Segall  of  DMDC,  unlike  the  Owens  Bayes  algorithm, 
is  independent  of  the  order  the  test  administers  the  items  and  whether  or  not  the  test 
administers  an  item  of  higher  diffieulty  to  an  examinee  after  a  eorreet  answer  [Krass 
2005].  Caleulation  time  is  slower  than  the  Owens  Bayes  algorithm,  but  it  is  still  within 
30  seeonds,  whieh  is  our  eriterion  for  an  aceeptable  solution  time  for  a  CAT  [Krass 
2005], 
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IV.  RESULTS  OF  CAT-ASVAB  OPTIMIZATION  SIMULATIONS 


A.  SETUP  FOR  SIMULATION 

To  test  the  performanee  of  the  model,  we  run  simulations  for  eaeh  shadow  test 
variation.  GAMS  [GAMS  2006]  generates  all  integer  linear  programs  (ILP)  and  XA 
[Sunset  2003]  solves  them  on  a  1.7  GHz  Dell  workstation.  We  use  a  similar  approaeh  to 
Chang  and  van  der  Linden’s  paper  on  item  stratifieation  and  seleet  a  few  ability  levels  for 
the  simulations.  Those  ability  levels  are  0=-1.5,  -1.0,  -0.5,  0,  0.5,  1.0,  and  1.5.  For 
eaeh  of  these  ability  levels,  the  simulation  ereates  500  examinees.  Eaeh  examinee  takes  a 
test  generated  by  eaeh  of  the  five  variations.  The  first  is  the  eurrent  implementation  of 
the  CAT-ASVAB,  whieh  administers  items  by  maximum  information  (OM).  This  is  the 
benehmark  for  eomparing  the  other  four  variations.  The  other  four  shadow  test  variations 
eome  from  a  CAT-ASVAB  optimization  formulation;  the  variation  derived  from  Kunde’s 
paper  and  peneil  formulation  adapted  for  the  CAT  (KM),  the  variation  with  eonstraints  on 
the  diffieulty  parameters  (DM),  the  variation  using  item  stratifieation  (SM),  and  the 
variation  with  item  stratifieation  and  diffieulty  parameter  eonstraints  (SDM). 

Diseretization  of  ability  levels  provide  information  only  for  those  values  of  0 
seleeted.  But  we  have  high  eonfidenee  for  those  ability  levels.  This  diseretization  also 
eorresponds  to  an  underlying  assumption  that  examinee  ability  levels  follow  a  uniform 
distribution.  An  alternative  strategy  would  be  to  sample  from  a  eontinuous  distribution 
(for  example,  the  standard  normal).  Previous  CAT  researeh  has  observed  that  sampling 
from  a  eontinuous  distribution  of  0  would  imply  using  enormous  sample  sizes  to  get 
reasonable  estimates  of  the  bias  and  mean  squared  error  (MSE)  funetions,  whieh  still 
would  have  to  be  pooled  over  elasses  of  0  values  and  be  aeeurate  only  near  the  eenter  of 
the  distribution  [Chang  and  van  der  Einden,  2003].  There  are  two  eonsequenees  from 
this  assumption.  “Eirst,  the  results  for  the  bias  and  MSE  funetions  are  eonditional  on  0 
[Chang  and  van  der  Einden,  2003].”  But,  beeause  the  aeeuraey  of  these  funetions  are  not 
dependent  on  the  distribution  of  the  examinees,  one  ean  generalize  the  results  for  the  bias 
and  MSE  to  any  population  of  examinees.  “Seeond,  the  results  for  the  item  exposure 
rates  do  not  neeessarily  generalize  to  other  populations  of  examinees  [Chang  and  van  der 
Einden,  2003].” 
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The  item  pool  eontains  approximately  170  items  and  eomes  from  the 
Mathematieal  Knowledge  test  for  the  CAT-ASVAB  [Sands,  Waters,  and  MeBride  1999]. 
These  items  are  an  experimental  set  and  are  not  an  aetual  item  pool  eurrently  in  use  for 
the  CAT-ASVAB.  Eaeh  shadow  test  variation  has  about  2,500  eonstraints,  350  binary 
variables,  and  2,000  eontinuous  variables. 

The  initial  ability  estimate  for  eaeh  test  variation  is  0  =  0.  After  the  simulated 
examinees  take  the  tests,  the  simulation  outputs  a  set  of  deviations  between  the  true  and 
estimated  0  for  eaeh  examinee.  Then  using  S-Plus  6.2  [Insighful  2003],  we  run  a 
Wileoxon  Sign-Rank  Test  to  eompare  eaeh  shadow  test  variation’s  deviation  distribution 
to  OM  [e.g.  Conover  1999].  Table  1  gives  the  parameters  used  for  the  shadow  tests. 


1  For  all  Shadow  Test  Variations  I 

Forms  per  Shadow  Test 

2 

Number  of  Items  per  Form 

15 

Scaling  Factor  (Z)) 

1.7 

Number  of  items  required  from  taxonomy 
group  1  ( NITEM^  ) 

2 

Number  of  items  required  from  taxonomy 
group  2  ( NITEM^ ) 

4 

Number  of  items  required  from  taxonomy 
group  3  ( NITEM^ ) 

8 

Number  of  items  required  from  taxonomy 
group  4  ( NITEM^ ) 

1 

For  DM  and  SDM 

Maximum  allowable  deviation  of  difficulty 
from  current  ability  (hLimit) 

0.5 

For  SM  and  SDM 

Number  of  items  required  from  strata  1(5,) 

3 

Number  of  items  required  from  strata  2  ( ) 

4 

Number  of  items  required  from  strata  3  ( ^3 ) 

4 

Number  of  items  required  from  strata  4  ( ^4 ) 

4 

Repetitions  (or  number  of  examinees) 

500 

Table  1:  Parameter  Settings  for  Formulations 
There  are  five  variations  altogether  (the  four  shadow  test  variations  and 
OM)  with  3,500  repetitions  for  eaeh  (500  repetition  for  seven  given  ability 
levels). 

B,  RESULTS 

Table  2  shows  the  taxonomy  distribution  for  the  simulations.  The  simulation 
altogether  seleets  52,500  items  (15  items  for  eaeh  of  the  3,500  tests)  for  eaeh  test 
variation.  OM  performs  poorly  in  terms  of  the  taxonomy  constraints  specified.  A 
majority  of  items  administered  in  the  OM  simulation  come  from  taxonomy  group  3.  This 
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is  most  likely  because  in  the  item  pool,  103  of  the  170  items  are  in  taxonomy  group  3. 
On  the  other  hand,  the  four  shadow  test  variations  follow  the  taxonomy  constraints  shown 
in  Table  1. 


1  Taxonomy  | 

Group 

KM,  DM,  SM,  and  SDM 

OM 

1 

7000 

2858 

2 

14000 

6028 

3 

28000 

40150 

4 

3500 

3464 

Table  2:  Taxonomy  Distribution 

Taxonomy  distribution  for  OM  heavily  favors  taxonomy  group  3,  while 
the  taxonomy  distribution  for  KM,  DM,  SM,  and  SDM  follow  the 
parameters  set  by  the  simulation  (shown  in  Table  1) 


Table  3  shows  the  solution  times  of  each  shadow  test  variation.  The  times  include 
the  program  generation,  runtime,  and  output  time  for  GAMS.  KM  and  DM  have 
acceptable  results  with  maximum  solution  times  under  10  seconds.  The  item 
stratification  variations,  SM  and  SDM,  however,  have  higher  maximum  solution  time. 
The  long  solution  time  occurred  primarily  at  the  selection  of  the  12*''  item,  which  is  the 
beginning  of  the  4*  and  final  stage.  With  the  exception  of  that  item,  solution  times  are  as 
quick  as  the  other  variations  for  the  selection  of  the  rest  of  the  items  in  the  test.  If 
needed,  the  maximum  solution  times  could  possibly  be  reduced  by  using  direct  problem 
generation  or  another  solver.  But,  we  do  not  explore  these  options  in  this  thesis. 


Solution  Time  (seconds) 

Shadow  Test 
Variation 

Max 

Min 

Average 

KM 

7.731 

0.24 

0.472 

DM 

3.245 

0.27 

0.522 

SM 

1036.66 

0.27 

1.865 

SDM 

189.012 

0.34 

3.924 

Table  3:  Solution  Times 

The  solution  time  for  KM,  DM,  SM,  and  SDM,  on  average,  is  acceptable. 
But,  the  high  maximum  solution  times  for  SM  and  SDM  make  them 
infeasible  options. 


Figure  2  shows  the  exposure  rates  of  the  items  for  each  variation.  They  are 
calculated  by  dividing  the  number  of  times  the  item  is  administered  by  the  number  of 
tests.  The  x-axis  lists  the  items  in  descending  order  according  to  their  exposure  rates. 
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Although  SM  and  SDM  start  off  much  higher,  all  of  the  shadow  test  variations  eventually 
end  up  approaching  a  more  uniform  distribution  than  OM.  OM  has  the  highest  amount  of 
unused  items  at  77  hems.  SDM  and  SM  have  the  next  highest  number  of  unused  items  at 
37  and  34  items  respectively.  Of  even  more  concern,  however,  are  the  extremely  high 
exposure  rates  with  SM  carrying  a  maximum  exposure  rate  of  1  and  SDM  carrying  a 
maximum  exposure  rate  of  0.86.  The  problem  items,  although  different  for  each 
variation,  are  distributed  at  the  start  of  the  exam.  A  possible  reason  for  this  is  that  items 
at  the  beginning  of  the  test  have  a  lower  discrimination.  So  their  Sympson  and  Hetter 
parameters  are  very  high  (close  to  or  equal  to  1),  making  the  test  much  less  likely  to 
reject  the  items.  Therefore,  the  Sympson  and  Hetter  algorithm  would  rarely  reject  an 
item  at  the  first  stage.  KM  and  DM  administered  all  of  the  items  in  their  simulations.  As 
the  graph  shows,  the  curves  for  KM  and  DM  have  the  flattest  slopes,  which  indicate  low 
maximum  exposure  rates  and  low  number  of  unutilized  items. 


Exposure  Rates 


Figure  2:  Exposure  Rates; 

OM  is  given  by  a  solid  line.  KM  is  given  by  a  thin  dashed  line.  DM  is 
given  by  a  bold  dashed  line,  SM  is  given  by  a  thin  dotted  line,  and  SDM  is 
given  by  a  bold  dohed  line.  The  x-axis  lists  the  items  in  descending  order 
according  to  their  exposure  rates. 
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Figures  3-7  below  are  the  histograms  of  the  errors  for  each  test  variation.  The 
error  for  each  examinee’s  estimated  ability  is: 

^k-^k- 

where  is  the  estimated  ability  level  of  examinee  k  after  the  exam,  and  6^^.  is  examinee 
A:’s  true  ability  level.  There  are  3,500  examinees  for  each  test  variation  (500  examinees 
for  each  of  the  seven  pre-selected  ability  levels).  The  Wilcoxon  Sign  Rank  test  p-values 
are  given  in  Table  4.  For  this  simulation,  we  use  a  two-sided  test  to  determine  whether 
there  is  a  difference  between  the  mean  and  medians  of  each  shadow  test  variation’s 
deviation  distribution  to  that  of  OM.  Using  a  90%  Confidence  Interval,  a  p-value  of 
under  0.05  would  indicate  a  significant  difference  between  the  means  and  medians  of  a 
given  formulation  against  OM.  The  p-values  for  DM  and  SDM  are  equal  to  zero. 
Therefore,  DM  and  SDM  differ  significantly  from  OM. 


p-values  overall 

KM 

DM 

SM 

SDM 

0.1417 

0 

0.2489 

0 

Table  4:  p-values  versus  OM  for  Wilcoxon  Sign  Rank  Test 
DM  and  SDM  significantly  differ  from  OM  because  their  p-values  are 
below  0.05. 


OM  Histogram  of  Errors 


Figure  3:  Error  Flistogram  of  OM 

The  x-axis  gives  the  error  range  for  0  (given  by  4-61);  The  y-axis  gives 
the  frequency  for  the  errors 
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Figure  4:  Error  Histogram  of  KM 

The  x-axis  gives  the  error  range  for  0  (given  by  4-61);  The  y-axis  gives 
the  frequency  for  the  errors 


DM  Histogram  of  Errors 


Figure  5:  Error  Histogram  of  DM 

The  x-axis  gives  the  error  range  for  0  (given  by  4  -6l  );  The  y-axis  gives 
the  frequency  for  the  errors 
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Figure  6:  Error  Histogram  of  SM 

The  x-axis  gives  the  error  range  for  0  (given  by  4-61);  The  y-axis  gives 
the  frequeney  for  the  errors 


SDM  Histogram  of  Errors 


Figure  7:  Error  Histogram  of  SDM 

The  x-axis  gives  the  error  range  for  0  (given  by  4-61 );  The  y-axis  gives 
the  frequency  for  the  errors 


Eigures  8  and  9  below  show  the  bias  and  mean  squared  error  (MSE)  functions. 
The  values  in  the  graphs  are  discrete  with  polynomial  interpolation  (from  MS  Excel)  to 
obtain  the  intermediate  values.  In  terms  of  the  bias  functions,  each  test  variation 
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performs  similarly  with  a  large  bias  for  more  extreme  negative  ability  levels.  The  graphs 
are  also  consistent  with  the  results  from  the  Wilcoxon  Sign  Rank  Test.  The  KM  and  SM 
curves  have  steep  slopes  like  the  OM  curve  at  the  extreme  negative  values  of  0.  OM 
performs  better  than  KM  and  SM  for  most  of  the  curve,  and  performs  better  than  all  of 
the  shadow  test  variations  at  0  >  0.5.  This  is  not  surprising  as  there  are  no  taxonomy 
constraints  on  OM.  The  two  variations  that  were  shown  to  be  significantly  different  than 
OM,  DM  and  SDM,  have  a  flatter  slope  and  do  not  have  the  steep  negative  slope  at  the 
extreme  negative  ability  levels.  Of  particular  note,  DM  performs  better  than  OM  for 
most  of  the  curve  at  0  <  0.5.  Also,  with  the  exception  of  0  =  -0.5  where  the  magnitude  of 
the  bias  is  only  slightly  higher  than  that  of  OM,  SDM  performs  better  than  OM  at  the 
same  regions  as  DM. 

Because  the  bias  functions  for  each  variation  behave  similarly,  it  is  not  surprising 
that  the  MSB  functions  for  each  variation  do  as  well,  with  large  errors  as  0  approaches  the 
extreme  negative  values.  OM  performs  the  best  for  most  of  the  curve,  0  >  0,  and 
performs  better  than  KM  and  SM  for  all  values  of  0.  Like  the  bias  curve,  the  MSB  curves 
for  DM  and  SDM  are  flatter  than  OM,  and  therefore  perform  better  at  extreme  negative 
values  of  0,  with  DM’s  MSB  lower  than  SDM’s  MSB  for  the  whole  curve. 


OM  is  given  by  a  solid  line.  KM  is  given  by  a  thin  dashed  line.  DM  is 
given  by  a  bold  dashed  line,  SM  is  given  by  a  thin  dotted  line,  and  SDM  is 
given  by  a  bold  dotted  line. 
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OM  is  given  by  a  solid  line.  KM  is  given  by  a  thin  dashed  line.  DM  is 
given  by  a  bold  dashed  line,  SM  is  given  by  a  thin  dotted  line,  and  SDM  is 
given  by  a  bold  dotted  line. 
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V.  CONCLUSIONS  AND  FUTURE  RESEARCH 


A,  CONCLUSIONS 

The  simulation  results  show  that  the  eurrent  implementation  of  the  CAT  would 
benefit  from  the  use  of  shadow  tests.  The  primary  motivation  behind  using  the  shadow 
tests  for  the  CAT-ASVAB  is  to  eontrol  taxonomy.  This  thesis  introduees  integer  linear 
program  (ILP)  formulations  that  aehieve  this  objeetive  while  our  eomputational 
experienee  shows  that  the  eurrent  method  of  item  seleetion  for  the  CAT-ASVAB  (OM) 
has  a  taxonomy  distribution  that  heavily  favors  one  taxonomy  group.  In  the  area  of  item 
exposure,  there  are  also  signifieant  benefits  over  OM.  There  are  fewer  unutilized  items 
for  eaeh  shadow  test  variation.  In  the  ease  of  the  first  and  seeond  shadow  test  variation 
(KM  and  DM),  all  items  are  administered,  and  maximum  exposure  rates  are  also  lower 
than  OM.  The  eonsequenee  of  using  the  shadow  test  variations  instead  of  OM  is  a  slight 
loss  in  preeision.  As  stated  in  Chang  and  van  der  Linden’s  paper,  “the  loss  (in  aeeuraey) 
ean  be  made  up  for  by  adding  a  few  items  to  the  test,  whereas  the  loss  in  eredibility  for  a 
testing  program  due  to  item  eompromise  or  the  finaneial  loss  involved  in  ineffieient  item 
usage  is  mueh  more  difficult  to  compensate  [Chang  and  van  der  Linden,  2003].” 

Given  the  five  metrics  for  the  simulation  (bias,  mean  squared  error  (MSB), 
exposure  rates,  solution  times,  and  taxonomy  distribution),  DM  would  be  the  most 
recommended  amongst  the  shadow  test  variations.  Like  the  rest  of  the  shadow  test 
variations,  it  meets  the  taxonomy  constraints,  with  the  solution  time  on  average  being  the 
fastest.  It  actually  has  a  lower  bias  for  most  of  the  curve  than  OM.  Finally,  the  mean 
squared  error  (MSB)  is  the  second  lowest  next  to  OM  and  even  has  a  lower  MSB  at  the 
negative  values  of  0.  On  the  other  hand,  because  of  the  high  maximum  exposure  rates 
and  maximum  solution  times,  the  shadow  test  variations  with  item  stratification  (SM  and 
SDM)  would  not  be  recommended,  despite  also  having  a  close  bias  and  MSB  to  OM. 

B,  FUTURE  RESEARCH 

Because  an  experimental  set  of  items  comprises  the  item  pool  for  this  thesis 
simulation,  further  research  can  use  an  existing  or  future  item  pool  to  execute  the 
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formulations.  Also,  only  data  for  the  Mathematical  Knowledge  (MK)  test  is  used. 
Therefore  item  pools  for  the  other  CAT-ASVAB  tests  can  be  used  in  future  research. 
Another  area  that  can  be  extended  is  the  sampling  of  the  examinees.  One  could  use  a 
continuous  distribution  instead  of  sampling  discrete  values  of  0.  Also,  this  thesis  only 
uses  MSB  and  bias,  whereas  the  current  CAT-ASVAB  uses  the  Birnbaum  Score 
Information  Function  to  measure  precision  of  the  exam  [Sands,  Waters,  and  McBride 
1999].  Therefore,  future  research  can  also  use  this  function. 
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